Photograph by Eric Gaddy with Casting Shadows Photography
Charlotte, North Carolina: home of NASCAR, barbeque, banking and … craft beer?
Over the last five years, a number of breweries have sprung up in Charlotte and already it’s hard to keep up with the new names. Last summer, a Fortune article covered Charlotte’s growth into the craft beer industry, even calling Charlotte a hub for craft beer. While the same could be said for many other cities in the United States, in this tutorial we’re going to explore ways to analyze Charlotte’s growth into craft beer with Twitter data.
In this three-part tutorial, We will explore beer-related Tweets from Charlotte between Dec 2015 to Feb 2016 based on data from Gnip’s Historical PowerTrack API. Tweets are identified in Charlotte based on geo-location (GPS coordinates); therefore, this analysis excludes Tweets without geo-location.
The purpose of these posts are to be tutorials for UNC Charlotte social scientists on ways to analyze Twitter data. While I find some neat results, these posts are not to be definitive analyses of the Charlotte’s beer industry and since our data is limited to only geo-located Tweets, we cannot generalize our results to all Tweets.
The data and funding for this tutorial is provided through UNC Charlotte’s Data Science Initiative in collaboration with Project Mosaic, the university’s social science research initiative.
For these tutorials, I’m using R, which is a widely popular open-source statistical and graphic computing software. Through the tutorial, I’ll overlay R code blocks and output like this:
2 + 3
## [1] 5
If you’re new to R, don’t stress over each line of code. For brevity, I’m not going to explain all of the functions and R language. Most of the code shown will be basic data manipulation with some pre-created functions I made to minimize the amount of code. For those interested in R data manipulations (data frames), here’s a great reference. Nevertheless, my secondary goal is to introduce R to new users and show how powerful a few lines of code can be!
The code and data can be found on my GitHub page. This tutorial was based on materials provided by Pablo Barberá. His references are phenomenal and I highly recommend his GitHub page for a ton of great tutorials and references.
Let’s load our Charlotte Beer Twitter dataset using the read.csv function.
beer.tweets <- read.csv("../datasets/CLT_beer_tweets.csv", encoding = "UTF-8",
stringsAsFactors = FALSE)
First, how many Tweets do we have?
nrow(beer.tweets)
## [1] 5151
What is the range of dates for the Tweets?
# first Tweet
min(beer.tweets$postedTime)
## [1] "2015-11-30 19:41:49 EST"
# last Tweet
max(beer.tweets$postedTime)
## [1] "2016-02-29 18:50:05 EST"
The first Tweet was on Nov 30, 2015 at 7:41pm (Eastern time) and the last Tweet was on Feb 29, 2016 at 6:50pm (Eastern time).
And what are the columns (variables) in the dataset?
colnames(beer.tweets)
## [1] "body" "postedTime"
## [3] "actor.id" "displayName"
## [5] "summary" "friendsCount"
## [7] "followersCount" "statusesCount"
## [9] "actor.location" "generator"
## [11] "geo.type" "point_lat"
## [13] "point_long" "urls.0.expanded_url"
## [15] "klout_score" "hashtags"
## [17] "user_mention_ids" "user_mention_screen_names"
This dataset includes 18 variables. The data includes details about the Tweet and its author.
Here’s a brief metadata table that explains key variables that we’ll use in our analysis:
| Column | Description | Category |
|---|---|---|
| body | Tweet content/text | Tweet |
| postedTime | Time of Tweet | Tweet |
| displayName | Twitter Username | User |
| summary | Twitter profile description | User |
| friendsCount | User’s friends count | User |
| followersCount | User’s followers count | User |
| statusesCount | User’s Tweet count | User |
| actor.location | User’s self-reported location | User |
| generator | Mobile device or app used | Device |
| geo.type | Geo Type: Point or Polygon | Geo-location |
| point_lat | Point Latitude | Geo-location |
| point_long | Point Longitude | Geo-location |
Now, let’s run pre-created functions in the functions.R file.
Like a library, the source function runs these functions so that we can use them.
source('../functions.R')
First, let’s plot the daily count of Tweets along with a smoothing line, then we’ll ask some questions designed to help interpret the data. These interactive graphs are created using Plotly.
timePlotly(beer.tweets)
What are the spikes?
And, why was there a drop in late December?
Wondering why there was a weak turnout around the January 22-23 weekend?
But are there other time patterns – like daily or hourly?
Next, let’s create a box-plot for day of week and a line plot for hour to examine.
weekPlotly(beer.tweets)
This plot shows how Tweets vary by the day of the week.
The weekend makes a difference. On average, there were about 90 geo-located point Tweets on Saturdays. Friday and Sundays average about 50-60 beer Tweets, but have some variability.
On the other hand, Mondays are the slowest beer day, averaging about 20 beer-Tweets per day. This isn’t completely surprising as a handful of bars and breweries are closed on Mondays. The middle of the week (Tues-Thurs) are flat, about 40-50 beer-Tweets per day, with some variability for Wednesdays and Thursdays.
So no surprises: the weekends are beer time!
What about an hourly plot?
hourPlotly(beer.tweets)
For this plot, we plot the Tweets by hour of the day (using a 24 hour clock). Try clicking on one of the days’ names in the legend. It filters out the selected day.
Filter out Monday to Wednesday. Saturdays, Tweets start coming in around 11am through the night. Sundays, Tweets start at 12 but fall off around 8pm.
It’s clear that time is an important factor in the number of Tweets. Let’s now explore location as measured through GPS data: geo-location.
Twitter’s location data falls into three categories (“3 P’s”): points, polygons and profile locations.
These location types can be divided into two groups: geo-location and descriptive.
Geo-location data, which include points and polygons, are accompanied by GPS coordinate information. Descriptive data are text provided by users and associated with each’s profile. The most common example is the profile location description. A user’s profile country code (e.g., US = United States) is also considered descriptive location data.
A key difference between geo-location data and descriptive data is that users must opt-in to provide geo-location data; descriptive data can be provided or removed at any time. Table 1 outlines the differences between each of the three types of Twitter location data.
| Types | Description | Geo-location? |
|---|---|---|
| Point | Activity location via GPS signal | Yes, Lat / Long Coordinates |
| Polygon (Place) | Mentioned place (e.g. city, neighborhood) | Yes, Lat / Long Bounding Box |
| Profile Location | Descriptive text, associated with profile | No, only descriptive text |
A point is a Tweet with coordinates via GPS enabled devices. This location does not contain any contextual information unless combined with a Twitter polygon (place). Points have single values in the point_lat and point_long columns.
A polygon, sometimes referred by Twitter and Gnip as a place (see this blog post), is a GPS polygon consisting of four coordinates, a “box” of a general area. Polygons have a display name association (e.g., a city, country code) and an associated place ID with a range of attributes. Polygons do not have unique values in the point_lat and point_long columns, but instead have values in bounding box latitude/longitude points (these fields have been omitted from our sample dataset). For details about the polygon (place) attributes, check out this webpage from Twitter’s API.
As a rule of thumb, you can think of points as specific geo-locations while polygons are broad geo-locations (like the name of a city).
Most Tweets do not have any geo-location information (point or polygon data)! This dataset excludes non-geolocated Charlotte Beer Tweets. For example, I tweet while standing in Charlotte but I have geo-location disabled.
For simplicity in this tutorial, we will not try to overcome this problem. However, this assumption leaves the results with the caveat that they may not fully represent the full population of all Charlotte beer-related Tweets. When using Twitter geo-located data, it is critical that researchers and practioners understand and be aware of the potential implications to their results (e.g. sample selection bias).
For a deeper dive into Twitter’s geo-location and its limitations, check out this great blog post.
Now back to our analysis. Let’s use the dplyr package to aggregate our data and count how many points versus place Tweets our dataset includes. This package can be used to aggregate relational data similar to SQL’s group by and aggregate commands.
library(dplyr)
geo_type_cnt <- beer.tweets %>%
group_by(geo.type) %>%
summarize(count = n())
Let’s create a bar chart to compare the number of points and polygon Tweets.
plotlyBarChart(geo_type_cnt)
Most of the tweets in our dataset are points (91%); only 9% of the Tweets are polygons. For reference, the original dataset of all Charlotte geo-located Tweets had about a 50% / 50% mix of point and polygons. Therefore, this dataset has a high ratio of points-to-polygons.
Given their prevalence, let’s consider the points. If we are to plot them on a map, where are the Tweets located geographically?
beer.tweets.pt <- subset(beer.tweets, geo.type == "Point")
plots <- list()
for (i in c("Region", "Charlotte", "Uptown")) {
p1 <- heatmapPlot(beer.tweets.pt, zoom = i)
plots[[i]] <- p1 # add each plot into plot list
}
plots
## $Region
##
## $Charlotte
##
## $Uptown